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Abstract 

This paper presents an algorithm for learning a 
highly redundant inverse model in continuous and 
non-preset environments. Our Socially Guided In- 
trinsic Motivation by Demonstrations (SGIM-D) al- 
gorithm combines the advantages of both social 
learning and intrinsic motivation, to specialise in a 
wide range of skills, while lessening its dependence 
on the teacher. SGIM-D is evaluated on a fishing 
skill learning experiment. 

1 Approaches for Adaptive Personal Robots 

The promise of personal robots operating in human environ- 
ments to interact with people on a daily basis points out the 
importance of adaptivity of the machine to its changing and 
unlimited environment, to match its behaviour and learn new 
skills and knowledge as the users' needs change. 

In order to learn an open-ended repertoire of skills, devel- 
opmental robots, like animal or human infants, need to be en- 
dowed with task-independen t mechanisms to explore new ac- 
tivitie s and new situations [We ng et al, 200l[|Asada et al, t 
20091. The set of skills that could be learnt is infinite but can 
not be learnt completely within a life- time. Thus, deciding 
how to explore and what to learn becomes crucial. Exploration 
strategies of the recent years can be classified into two fami- 
lies: 1) socially guided exploration; 2) internally guided explo- 
ration and in particular instrinsically motivated exploration. 

1.1 Socially Guided Exploration 

To build a robot that can learn and adapt to human envi- 
ronment, the most straightforward way might be to transfer 
knowledge about tasks or skills from a human to a machine. 
Several works incorporate human input to a machine learn- 
ing process, for i nstance through human guidance to learn 
by demonstration [ Chernova and Veloso, 2009 ; Lopes et al , 



cumulative lear ning of skills |Weng et al, 200 1| |Lopes and 
Oudeyer, 201 0[ . The word intrinsic motivation in psycho! 



|2009l ICederborg et al, 20101|Calinon, 20091 or by physi- 
cal guidance | |Calinon et al, 2007) , through human control 
of the reinforcement learning reward [Blumberg et al, 2002; 
Kapl an et al, 2002) , through human advice! Clou se and Ut-| 
goff, 1992|, or through hum an tele-operation during training 
[ Sma rt and Kaelbling, 2002[ . However, high dependence on 
human teaching is limited because of human patie nce, am- 
biguous human input, t he correspondence problem [Nehaniv 



and Dautenhahn, 2007], etc. Increasing the learner's auton- 
omy from human guidance could address these limitations. 
This is the case of internally guided exploration methods. 

1.2 Intrinsically Motivated Exploration 

Intrinsic motivation, an example of internally guided explo- 
ration, has drawn attention recently, especially for open-ended 



ogy describes the attraction of humans toward different ac- 
tivities for the pleasure they experience intrinsically. This 
is crucial f or autonomous learning and discovery of new 
capabilities [Ryan and Deci, 2000[ |Deci and Ryan, 1985[ 
Oudeyer and Kaplai y2008||. This inspired the creation of fully 



autonomous robots l|Barto et al, 2004] |Oudeyer et al, 2007 



Bara nes and Oudeyer, 2009[ |Sch midhub er, 2010} Schembri 



et al, 2007) with meta-exploration mechanisms monitoring 



the evolution of learning performances of the robot, in or- 
der to maximise information al gain, and with heuristics defin- 
ing the notion of interest JFedorov, 1972 Cohn et al, 1996[ 
|Roy and McCallum, 2001] . 

Nevertheless, most intrinsic motivation approaches address 
only partially the challenges of u nlearnability and unbounded- 
ness [ [Oudeyer et al, to appear) . As interestingness is based 
on the derivative of the evolution of performance of acquired 
knowledge or skills, computing measures of interest requires 
a level of sampling density that decreases the efficiency as 
the level of sampling grows. Even in bounded spaces, the 
measures of interest, mostly non-stationary regressions, face 
the curse of dimensionality [Bishop, 2007) . Thus, without 
additional mechanisms, the identification of learnable zones 
where knowledge and competence can progress, becomes in- 
efficient. The second limit relates to unboundedness. If the 
measure of interest depends only on the evaluation of perfor- 
mances of predictive models or of skills, it is impossible to 
explore/sample inside all localities in a life time. Therefore, 
complementary mechanisms have to be introduced in order 
to constrain the growth of the size and complexity of practi- 
cally explorable spaces and allow the organism to introduce 
self-limits in the unbounded world and/or drive them rapidly 
toward learnable subspaces. Among constraining processes 
are motor synergies, morphological computation, maturational 
constraints as well as social guidance. 

1.3 Combining Internally Guided Exploration and 
Socially Guided Exploration 

Intrinsic motivation and socially guided learning, traditionally 
opposed, yet strongly interact in the daily life of humans. Both 
approaches have their own limits, but combining both could on 
the contrary solve them. 

Social guidance can drive a learner into new intrinsically 
motivating spaces or activities which it may continue to ex- 
plore alone for their own sake, but which might have been 
discovered only thanks to social guidance. Robots may ac- 
quire new strategies for achieving those intrinsically motivated 
activities by external observation or advice. Reinforcement 
learning can let the human directly control the actions of a 
robot age nt with teleoperation to provide example task demon- 
strations [Pete rs and Schaal, 2008[ [kormush ev et al, 20101 
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which initialize the learning process by imitation learning and 
subsequently, improve the policy by reinforcement learning. 
Nevertheless, the role of the teacher here is restricted to the 
initialisation phase. Moreover, these works aim at one partic- 
ular preset task, and do not explore the whole world. 

Inversely, as learning that depends highly on the teacher 
quickly discourages the user from teaching to the robot, in- 
tegrating self-exploration to social learning methods could re- 
lieve the user from overly time-consuming teaching. More- 
over, while self-exploration tends to result in a broader task 
repertoire, guided-exploration with a human teacher tends to 
be more specialised, with fewer tasks but faster learnt. Com- 
bining both can thus bring out a system that acquires a wide 
range of knowledge which is necessary to scaffold future 
learning with a human teacher on specifically needed tasks. 

In initial work in this direc t ion has been pr esented in 
|Thomaz and Breazeal, 2008 Thomaz, 2006) , Socially 



Guided Exploration's motivational drives, along with social 
scaffolding from a human partner, bias the behaviour to create 
learning opportunities for a hierarchical Reinforcement Learn- 
ing mechanism. However, the representation of the continuous 
environment by the robot is discrete and the set up is a limited 
and preset world, with few primitive actions possible. 

We would like to address the learning in the case of an un- 
bounded, non-preset and continuous environment. This pa- 
per introduces SGIM (Socially Guided Intrinsic Motivation), 
an algorithm to deal with such spaces, by merging socially 
guided exploration and intrinsic motivation. The next section 
describes SGIM's intrinsic motivation part before its social in- 
teraction part. Then, we present the fishing experiment and its 
results. 

2 Intrinsic Motivation : the SAGG-RIAC 
Algorithm 

In this section we introduce Self-Adaptive Goal Genera- 
tion - Robust Intelligent Adaptive Curiosity, an implementa- 
tion of com petence-based intrinsic motivations [Barane s and| 
Oude yer, 2010) . We chose this algorithm as the intrinsi- 
cally motivation part of SGIM for its efficiency in learning 
a wide range of skills in high-dimensional space including 
both easy and unlearnable subparts. Moreover, its goal di- 
rectedness allows bidirectional merging with socially guided 
methods based on feedback on either goal and/or means. Its 
ability to detect unreachable spaces also makes it suitable for 
unbounded spaces. 

2.1 Formalisation of the Problem 

Let us consider a robotic system which configurations/states 
are described in both a state space X (eg. actuator space), 
and an operational/task space Y. For given configurations 
O^i, 2/1 ) £ X x Y, an action a G A allows a transition to- 
wards the new states (#2, 2/2) G X x Y\ We define the action 
a as a parameterised dynamic motor primitive. While in clas- 
sical reinforcement learning problems, a is usually defined as 
a sequence of micro-actions, parameterised motor primitives 
consist in complex closed-loop dynamical policies which are 
actually temporally extended macro-actions, that include at the 
low-level long sequences of micro-actions, but have the advan- 
tage of being controlled at the high-level only through the set- 
ting of a few parameters. The association M : (#1, yi, a) H> 
(#2? 2/2) corresponds to a learning exemplar that will be mem- 
orised, and the goal of our system is to learn both the forward 



and inverse models of the mapping M. We can also describe 
the learning in terms of tasks, and consider y<i as a goal which 
the system reaches through the means a in a given context 
2/i )• I n the following, both points of view will be used 
interchangeably. 

2.2 Global Architecture of SAGG-RIAC 

SAGG-RIAC is a multi-level active learning algorithm and 
consists in pushing the robot to perform babbling in the goal 
space by self-generating goals which provide a maximal com- 
petence improvement for reaching those goals. Once a goal is 
set, a lower level active motor learning algorithm locally ex- 
plores how to reach the given self-generated goal. The SAGG- 
RIAC architecture is organised into 2 levels : 

• A higher level of active learning which decides what to 
learn, sets a goal y g G Y depending on the level of 
achievement of previous goals, and learns at longer time 
scale. 

• A lower level of active learning that attempts to reach the 
goal y g set by the higher level and learns at shorter time 
scale. 

2.3 Lower Level Learning 

The lower level is made of 2 modules. The Goal Directed 
Low-Level Interest Computation module guides the system to- 
ward the goal y g and creates a model of the world that may 
be reused afterwards for other goals. The Goal-Directed Low 
Level Actions Interest Computation module measures the in- 
terest level of the goal y g by Sim, a function representing the 
similarity between the final state yf of the reaching attempt, 
and the actual goal y g . The exact definition depends on the 
specific learning task, but Sim is to be defined in [— 00; 0], 
such that the higher Sim(y gj yf,p), the more efficient the 
reaching attempt is. 

2.4 Higher Level Learning 

The two modules of the higher level compute the interesting 
goals to explore, depending on the performance of the short- 
term level and the previous goals already explored. 

The Goal Interest Computation module relies on the feed- 
back of the lower level to map the interest level in the task 
space Y. The interest inter est i of a region Ri C Y is the 
local competence progress, over a sliding time window of the 
£ more recent goals attempted inside Ri : 
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where {2/1,2/2, 2/fcji^ are elements of Ri indexed by their 
relative time order of experimentation and r ) yj is the the com- 
petence of yj G Ri and defined with respect to the similarity 
between the final state yf of the reaching attempt, and the ac- 
tual goal yj : 



^ _ J Sim(yj,y f ,p) if Sim(y j: y f , p) < 
iyj ~ otherwise 



< 



The Goal Self -Generation module uses the measure of inter- 
est to split Y into subspaces to maximally discriminate areas 
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according to their levels of interest and select the region where 
future goals will be chosen. 

The goal self-generation mechanism involves random ex- 
ploration of the space in order to map the level of interest for 
the different subparts. This prevents it from exploring effi- 
ciently large goal spaces containing small reachable subparts 
because of the need for discrimination of these subparts from 
unreachable ones. In order to solve this problem, we propose 
to bootstrap intrinsic motivation with social guidance. In the 
following section, we review different kinds of social inter- 
actions modes then describe our algorithm SGIM-D (Socially 
Guided Intrinsic Motivation by Demonstration). 

3 SGIM Algorithm 

3.1 Formalisation of the Social Interaction 

Within the problem of learning the forward and the inverse 
models of the mapping M : (x±, 2/1, a) \-> (^2, 1/2), we would 
like to introduce the role of a human teacher to boost the learn- 
ing of the means a and goal 1/2 in the contexts (x\ , y\ ) and set 
a formalisation of the case where an imitator is trying to build 
good world models and where paying attention to the demon- 
strator is one strategy for speeding up this learning. Given the 
model estimated by the robot M R , and by the human teacher 
Mh, we can consider social interaction as a transformation 
Soclnter : (M R , M H ) ^ (M2 R ,M2 H ). The goal of the 
learning is that the robot acquires a perfect model of the world, 
i.e. that SocInter(M R ,M H ) = (M perfectj M perfect ). So- 
cial interaction is a combination of: the human teacher's be- 
haviour or guidance Soclnter h and the machine learner's be- 
haviour Soclnter R . We presume a transparent communica- 
tion between the teacher and the learner, i.e. the teacher can 
access the real visible state of the robot as a noiselessjunction 
of its internal state visible R (M R ). Let us note visible R the 
"perfect visible state" of the robot, i.e. the value of the visible 
states of the robot when its estimation of the model is perfect: 
M R = M per f ect . Moreover, we postulate that the teacher is 
omniscient, his estimation of the model is the perfect model 
Mh = M per f ect . Therefore, our social interaction is a trans- 
formation Soclnter : M R \-> M. 

In order to define the social interaction that we wish to con- 
sider, we need to peruse the different possibilities. 

3.2 Analysis of Social Interaction Modes 

First of all, let us define which type of interaction takes place, 
and what role we shall give to the teacher. Taking inspiration 
from psych ology, such as the use of moth erese in child de- 
velopment jBreazeal and Aryananda, 20021 or the importance 
of positive feedback | |Thomaz and Breazeal, 2008) , reward- 
like feedback seems to be important in learning. They typi- 
cally provide an estimation of a distance between the robot's 
visible state and its "perfect visible state" : Soclnter h ~ 

dist (visible Rl visible R ). Yet, this cheering needs to be com- 
pleted by games where parents show and instruct children in- 
teresting cases and help children reach their goals. Therefore, 
we prefer a demonstration type of interaction. Besides, so- 
cial interaction can be separated into two broad categories of 
social learning [ Call and Carpenter, 2002) : imitation where 
the learner copies the specific motor patterns a, and emulation 
where the learner attempts to replicate goal states 1/2 G F. To 
enable both imitation and emulation and influence the learner 
both from the action and goal point of view, we provide the 
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Figure 1 : Structure of SGIM-D (Socially Guided Intrinsic Motiva- 
tion by Demonstration). SGIM-D is organised into 2 levels. 

learner with both a means and a goal examples: Soclnter h £ 
AxY. Indeed, the teacher who shows both a means and a goal 
offers the best opportunity for the learner to progress, for the 
learner can use both the means or the goal-driven approach. 

Our next question is: when should the interaction occur? 
For the robot's adaptability or flexibility to the changing en- 
vironment and demand from the user, interactions should take 
place throughout the learning process. In order to test the effi- 
ciency of our algorithm and control the way interactions occur, 
we choose to trigger the interaction at a constant frequency. 

Lastly, to induce significative improvement of the learner, 
we shall provide him with demonstrations in a not yet learned 
subspace, in order to make the robot explore new goals and 
unexplored subspaces. 

So as to bootstrap a system endowed with intrinsic moti- 
vation, we choose a learning by demonstration of means and 
goals, where the teacher introduces at regular pace a random 
demonstration among the unreached goals for SGIM-D. 

3.3 Description of SGIM-D Algorithm 

This section details how SGIM learns an inverse model in a 
continuous, unbounded and non-preset framework, combining 
both intrinsic motivation and social interaction. Our Socially 
Guided Intrinsic Motivation algorithm merges SAGG-RIAC 
as intrinsic motivation, with a learning by demonstration, as 
social interaction. SGIM-D includes two different levels of 
learning (fig. [I}. 

Higher Level Learning 

The higher level of active learning decides which goal (#2, IJ2) 
is interesting to explore and contains 3 modules. The Goal 
Self -Generation module and the Goal Interest Computation 
module are as in SAGG-RIAC. The Social Interaction module 
manages the interaction with the human teacher. It interfaces 
the social guidance of the human teacher Soclnter h with the 
Goal Interest Computation Module and interrupts the intrin- 
sic motivation at every demonstration by the teacher. It first 
triggers an emulation effect, as it registers the demonstration 
(ademoi Vdemo) m the memory of the system and gives it as in- 
put to the goal interest computation module. It also triggers the 
imitation behaviour and sends the demonstrated action a^ emo 
to the Imitation module of the lower level. 

Lower Level Learning 

The lower level of active learning also contains 3 modules. 
The Goal Directed Exploration and Learning module and the 
Goal Directed Low Level Actions Interest Computation mod- 
ule, as in SAGG-RIAC, use Mh to reach the self-generated 
goal (#2, 1/2)- The Imitation module interfaces with the high- 
level Social Interaction module. It tries small variations to ex- 
plore in the locality of ademo and help address the correspon- 
dence problem in the case of a human demonstration which 
does not use the same parametrisation as the robot. 
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Fi glire 2 1 Fishing experimental setup. Our 6-dof robot arm manipulates a fishing 
rod. Each joint is controlled by a bezier curve parameterised by 4 scalars (initial, middle 
and final joint position and a duration). We track the position of the hook when it reaches 
the water surface. 

The above description is detailed for our choice of SGIM 
by Demonstration. Such a structure remains suitable for other 
choices of social interaction modes, we only have to change 
the content of the Social Interaction module, and change the 
Imitation module to the chosen behaviour. Our structure, no- 
tably, can deal with cases where the intrinsically motivated 
part gives a feedback to the teacher, as the Goal Interest Com- 
putation module and the Social Interaction module communi- 
cate bilaterally. For instance, the case where the learner asks 
the teacher for demonstrations can still use this structure. 

We have hitherto presented intrinsic motivation's SAGG- 
RIAC and analysed social learning and its different modes, 
to design Socially Guided Intrinsic Motivation by Demonstra- 
tion (SGIM-D) that merges both paradigms, to learn a model 
in a continuous, unbounded and non-preset framework. In the 
following section we use SGIM-D to learn fishing skill. 

4 Fishing Experiment 

This fishing experiment focuses on the learning of inverse 
models in a continuous space, and deals with a very high- 
dimensional and redundant model. The model of a fishing rod 
in a simulator might be mathematically computed, but a real- 
world fishing rod's dynamics would be impossible to model. 
A learning system of such a case is therefore interesting. 

4.1 Experimental Setup 

Our continuous environment sets a 6 degrees-of-freedom robot 
arm that learns how to use a fishing rod (fig. [2]), i.e. for a given 
goal position y g where the hook should reach when falling into 
the water, which action a to perform. 

In our experiment, X describes the actuator/joint positions 
and the state of the fishing rod. Y is a 2-D space that describes 
the position of the hook when it reaches the water. The robot 
always starts with the same initial position, x\ and y\ always 
take the same values x org and y org . Variable a describes the 
parameters of the commands for the joints. We choose to con- 
trol each joint with a Bezier curve defined by 4 scalars (initial, 
middle and final joint position and a duration). Therefore an 
action is represented by 24 parameters: a = (a 1 , a 2 , ...a 24 ) 
define the points x = (x 1 , x 2 , ...x 6 ) of the Bezier curve and 
then the joint positions made physically consistent which the 
robot attains sequentially . Because our experiment uses at 
each trial the same context (x orgi y org ), our system memorises 
after executing every action a only the association (a, 2/2) and 
learns the context-free association M : a \-> y 2 • 

The experimental scenario sets the robot to explore the 
task space through intrinsic motivation when it is not inter- 



rupted by the teacher. After P movements, the teacher in- 
terrupts whatever the robot is doing, and gives him an ex- 
ample (a demo 1 V 'demo)- The robot first registers that exam- 
ple in its memory as if it were its own. Then, the Imitation 
module tries to imitate the teacher with movement parameters 
^imitate = cidemo + ctrand where a rand is a random move- 
ment parameter variation, so that \a ran d\ < e. At the end 
of the imitation phase, SAGG-RIAC resumes the autonomous 
exploration, taking into account the new set of experience. We 
hereafter describe the low-level exploration, specific to this 
problem. 

4.2 Empirical Implementation of the Low-Level 
Exploration 

Let us first consider that the robot learns to reach a fixed goal 
position y g = (y g ,y g ). We first have to define the similarity 
function Sim with respect to the euclidian distance D : 



Sim{y g ,y f ,y org ) 



-1 

D(y g ,y rg) 



if 



D (y 9 ,yf) 

D(y g ,y org ) 

otherwise 



> 1 



To learn the inverse model InvModel : y i->- a, we use the 
following optimisation mechanism which can be divided into: 
a exploitation regime and an exploration regime. 

Exploitation Regime 

The exploitation regime uses the memory to locally interpo- 
late an inverse model. Given the high redundancy of the 
problem, we chose a local approach and extract the most 
reliable data by computing the set L of the l max nearest 
neighbours of y g and their co rresponding movement param- 
eters using an ANN method |Muja and Lowe, 2009[ which 
is based on a tree split using the k-means process: L = 
{(y,a) 1 ,(y,a) 2 ,...,(y,a)i max } C (Y x A) 1 ™* . 

Then, for each element (y,a)i G L, we compute 
its reliability. Let K\ be the set of the k max nearest 
neighbours of a\ chosen from the full dataset : K\ = 
{(y,a)i, (2/,a) 2 ,...,(2/,a)fc maaj }, and van is the variance of 
K\ . As the reliability of a movement depends both on the lo- 
cal knowledge and its reproductivity, we define the reliability 
of (y : a) i G L as dist(yi,y g ) + ax vari, where a is a constant 
(a = 0.5 in our experiment). We choose among L the smallest 
value, as the most reliable set (y, a) best- 

In the locality of the set (y,a)b e8 t, we interpolate using 
the k max elements of K^est to compute the action corre- 
sponding to y g : a g = Y^k=T coe fk^k where coef k ~ 
Gaussian(dist(yk, y g )) is a normalized gaussian. 

Exploration Regime 

The system just uses a random movement parameter to ex- 
plore the space. It continuously estimates the distance be- 
tween the goal y g and the closest already reached position 
y c , dist(y Cj y g ). The system has a probability proportional to 
dist(y c , y g ) of being in the exploration regime, and the com- 
plementary probability of being in the exploitation regime. 

4.3 Simulations 

The experimental setup has been designed for a human 
teacher. Nevertheless, to test our algorithm, to control better 
the demonstrations of the teacher, to be able to run statistics, 
and as starting point, we used V-REP physical simulator with 
ODE physics engine, which updates every 50 ms. The noise 
of the control system of the 3D robot is estimated to 0.073 
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Figure 3 ! Maps of the benchmark points used to assess the performance of the robot, 
and the teaching set, used in SGIM. 
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Figure 4 1 Histograms of the positions explored by the fishing rod inside the 2D 
goal space (y 1 ,y 2 ). On each row shows the timeline of the cumulated set of points 
throughout 5000 random movements. Each row represents a different learning algorithm 
: random input parameters, SAGG RIAC and SGIM-D. 

for 10 attempts of 20 random movement parameters while the 
reachable area spans between -1 and 1 for each dimension. 
Per experiment, we ran 5000 movements and assessed the per- 
formance on a 129 points benchmark set (fig. [5]) every 250 
movements. After several runs of random explorations and 
SAGG-RIAC, we determined the apparent reachable space as 
the set of all the reached points in the goal/task space, which 
makes up some 70 000 points. We then divided the space into 
small squares, and generated a point randomly in each square. 
Using a 26 x 16 grid, we obtained a set of 129 goal points 
in the task space, representative of the reachable space, and 
independent of the experiment data used. 

Likewise, we prepared a teaching set of 27 demonstrations 
(fig. [3]) and defined in the reachable space small squares 
subY. To each subY, we will choose a demonstration (a, y) 
so that y G subY . So that the teacher gives the most use- 
ful demonstration, we compute M^ 1 (subY) = {a\Mn ' 
a ^ y G subY}. We tested all the movement parameters 
a G M^ 1 (subY) to choose the most reliable one a^ emo , ie, 
the movement parameters that resulted in the smallest variance 
in the goal space a demo = min{var(M H (a)))} aeM -i {subYy 

4.4 Experimental results 
A Wide Range of Skills 

We ran the experiment in the same conditions but with differ- 
ent learning algorithms, and plotted in fig. [4] the histogram 
of the positions of the fishing rod when it reaches the water 
surface. The 1st line of fig. [4] shows that a natural position 
lies around (0.5, 0) in the case of an exploration with random 
movement parameters. Most movements parameters map to 
a position of the hook around that central position. The sec- 
ond line of fig. [4] shows the histogram in the task space of 
the explored points under SAGG-RIAC algorithm throughout 
different timeframes. Compared to a random parameters ex- 
ploration, SAGG-RIAC has increased the explored space, and 
most of all, explores more uniformly the explorable space. The 
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Figure 5 ! Evaluation of the performance of the robot under the learning algorithms: 
SAGG-RIAC and SGIM-D, when the task space is small or 20 times larger. We plotted 
the mean distance to the benchmark points over several runs of the experiment. 





Figure 6l Histograms of the goals set by the Goal Self-Generation Module when the 
task space is large. The different figures correspond to the results for different runs of 
the experiment with SAGG-RIAC algorithm (1st row) and SGIM-D algorithm (2nd row). 
Both rows figures have been zoomed and centred on the reachable space 

regions of interest change through time as the system finds 
new interesting subspaces to explore. Intrinsic motivation ex- 
ploration results in a wider repertoire for the robot. Besides, 
Fig. [4] highlights a region around (—0.5,-0.25) that was ig- 
nored by both the random exploration and SAGG-RIAC, but 
was well explored by SGIM-D. This isolated subspace cor- 
responds to a tiny subspace in the parameters space, seldom 
explored by the random exploration or seen by SAGG-RIAC 
which was focusing on the subspaces around the places it al- 
ready explored. On the contrary, in SGIM, the teacher gives 
a demonstration that brings new competence to the robot, and 
triggers the system's interest to define the area as interesting. 

Precision 

To assess the precision of its learning, and compare its perfor- 
mance in large spaces, we plotted the performance of SAGG- 
RIAC, SGIM-D and when performing only variations of the 
teacher's demonstrations (with the same number of demon- 
strations as with SGIM-D). Fig. 5] shows the mean error for 
the benchmark in the case of a task space bounded close to the 
reachable space, and when we multiplied the size by 20. In 
the case of the small task space, the plots show that SGIM-D 
performs better than SAGG-RIAC or the learning by demon- 
strations alone. As expected, performance decreases when the 
size of the task space increases (cf. section 1). However it 
improves with SGIM-D, and the difference between SAGG- 
RIAC and SGIM-D is more important in the case of a large 
task space, thus the improvement is most significative when 
the task space size increases. 

Identification of the reachable space 

This difference in the performance is explained by Fig [6] 
which plots the histogram of the set of the self-generated goals 
and the subspaces explored by the robot. We can see that in 
the second row, most goals are within the reachable space, and 
cover mostly the reachable space. This means the SGIM-D 
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could differentiate the reachable subspaces from the unreach- 
able subspaces. On the contrary, the first row shows goal 
points that appear disorganised : SAGG-RIAC has not iden- 
tified which subspaces are unreachable. Demonstrations given 
by the teacher improved the learner's knowledge of the inverse 
model I nv Model. We also note that the demonstrations oc- 
curred only once every 150 movements, meaning that even 
a slight presence of the teacher can improve significantly the 
performance of the autonomous exploration. In conclusion, 
SGIM-D improves the precision of the system with little inter- 
vention from the teacher, and helps point out key subregions 
to be explored. The role of SGIM-D is most significant when 
the size of the task space increases. 

5 Conclusion and Future Work 

Our experiment shows that SGIM learns a model of its envi- 
ronment, and little intervention from the teacher can improve 
its learning compared to demonstrations alone or SAGG- 
RIAC, especially in the case of a large task space. Even though 
the teacher is not omniscient, he can transfer his knowledge to 
the learner and bootstrap autonomous exploration. 

Nevertheless, in this initial validation study in simulation, 
we made strong supposition about the teacher. He has the 
same motion generation rules than the robot, and is omniscient 
so that he teaches the robot the reachable space. A study of a 
non-omniscient teacher should show how his demonstrations 
bias the subspaces explored by the robot. Experiments with 
human demonstrations need to be conducted to address the 
problems of correspondence and biased teacher. Albeit these 
suppositions, our simulation indicates that SGIM-D success- 
fully combines learning by demonstration and autonomous ex- 
ploration even in an experimental setup as complex as having 
a continuous 24-dimension action space. 

This paper introduces Socially Guided Intrinsic Motiva- 
tion by Demonstration, a learning algorithm for models in 
a continuous, unbounded and non-preset environment, which 
efficiently combines social learning and intrinsic motivation. 
It proposes a hierarchical learning with a higher level that de- 
termines which goals are interesting either through intrinsic 
motivation or social interaction, and a lower-level learning that 
endeavours to reach it. Our framework takes advantage of the 
demonstrations of the teacher to explore unknown subspaces, 
to gain precision, and efficiently identify the reachable space 
from the unreachable space even in large task spaces thanks to 
the knowledge transfer from the teacher to the learner. It also 
takes advantage of the autonomous exploration to improve its 
performance in a wide range of tasks in the teacher's absence. 

In our experiment, the robot imitates the teacher for a fixed 
duration before returning to emulation mode where SGIM-D 
takes into account the goal of this new data. However, future 
work on a more natural and autonomous algorithm to switch 
between imitation and emulation could improve the efficiency 
of the system. 
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